Matrices and Data Frames

1 Missing Values

No matter how carefully we collect our data, there will always be situations where we don't know the value of a particular variable. For example, we might conduct a survey where we ask people 10 questions, and occasionally we forget to ask one, or people don't know the proper answer. We don't want values like this to enter into calculations, but we can't just eliminate them because then observations that have missing values won't "fit in" with the rest of the data.

In R, missing values are represented by the string NA. For example, suppose we have a vector of 10 values, but the fourth one is missing. I can enter a missing value by passing NA to the c function just as if it was a number (no quotes needed):

x = c(1,4,7,NA,12,19,15,21,20)

R will also recognize the unquoted string NA as a missing value when data is read from a file or URL.

Missing values are different from other values in R in two ways:

Any computation involving a missing value will return a missing value.
Unlike other quantities in R, we can't directly test to see if something is equal to a missing value with the equality operator (==). We must use the builtin is.na function, which will return TRUE if a value is missing and FALSE otherwise.

Here are some simple R statements that illustrate these points:

> x = c(1,4,7,NA,12,19,15,21,20)
> mean(x)
[1] NA
> x == NA
[1] NA NA NA NA NA NA NA NA NA

Fortunately, these problems are fairly easy to solve. In the first case, many functions (like mean, min, max, sd, quantile, etc.) accept an na.rm=TRUE argument, that tells the function to remove any missing values before performing the computation:

> mean(x,na.rm=TRUE)
[1] 12.375

In the second case, we just need to remember to always use is.na whenever we are testing to see if a value is a missing value.

> is.na(x)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

By combining a call to is.na to the logical "not" operator (!) we can filter out missing values in cases where no na.rm= argument is available:

> x[!is.na(x)]
[1]  1  4  7 12 19 15 21 20

2 Matrices

A very common way of storing data is in a matrix, which is basically a two-way generalization of a vector. Instead of a single index, we can use two indexes, one representing a row and the second representing a column. The matrix function takes a vector and makes it into a matrix in a column-wise fashion. For example,

> mymat = matrix(1:12,4,3)
> mymat
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

The last two arguments to matrix tell it the number of rows and columns the matrix should have. If you used a named argument, you can specify just one dimension, and R will figure out the other:

> mymat = matrix(1:12,ncol=3)
> mymat
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

To create a matrix by rows instead of by columns, the byrow=TRUE argument can be used:

> mymat = matrix(1:12,ncol=3,byrow=TRUE)
> mymat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12

When data is being read from a file, you can simply imbed a call to scan into a call to matrix. Suppose we have a file called matrix.dat with the following contents:

7 12 19 4
18 7 12 3
9 5 8 42

We could create a 3×4 matrix, read in by rows, with the following command:

matrix(scan('matrix.dat'),nrow=3,byrow=TRUE)

To access a single element of a matrix, we need to specify both the row and the column we're interested in. Consider the following matrix, containing the numbers from 1 to 10:

> m = matrix(1:10,5,2)
> m
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

Now suppose we want the element in row 4 and column 1:

> m[4,1]
[1] 4

If we leave out either one of the subscripts, we'll get the entire row or column of the matrix, depending on which subscript we leave out:

> m[4,]
[1] 4 9
> m[,1]
[1] 1 2 3 4 5

3 Data Frames

One shortcoming of vectors and matrices is that they can only hold one mode of data; they don't allow us to mix, say, numbers and character strings. If we try to do so, it will change the mode of the other elements in the vector to conform. For example:

> c(12,9,"dog",7,5)
[1] "12"  "9"   "dog" "7"   "5"

Notice that the numbers got changed to character values so that the vector could accomodate all the elements we passed to the c function. In R, a special object known as a data frame resolves this problem. A data frame is like a matrix in that it represents a rectangular array of data, but each column in a data frame can be of a different mode, allowing numbers, character strings and logical values to coincide in a single object in their original forms. Since most interesting data problems involve a mixture of character variables and numeric variables, data frames are usually the best way to store information in R. (It should be mentioned that if you're dealing with data of a single mode, a matrix may be more efficient than a data frame.) Data frames correspond to the traditional "observations and variables" model that most statistical software uses, and they are also similar to database tables. Each row of a data frame represents an observation; the elements in a given row represent information about that observation. Each column, taken as a whole, has all the information about a particular variable for the data set.

For small datasets, you can enter each of the columns (variables) of your data frame using the data.frame function. For example, let's extend our temperature example by creating a data frame that has the day of the month, the minimum temperature and the maximum temperature:

> temps = data.frame(day=1:10,
+                min = c(50.7,52.8,48.6,53.0,49.9,47.9,54.1,47.6,43.6,45.5),
+                max = c(59.5,55.7,57.3,71.5,69.8,68.8,67.5,66.0,66.1,61.7))
> head(temps)
  day  min  max
1   1 50.7 59.5
2   2 52.8 55.7
3   3 48.6 57.3
4   4 53.0 71.5
5   5 49.9 69.8
6   6 47.9 68.8

Note that the names we used when we created the data frame are displayed with the data. (You can add names after the fact with the names function.) Also, instead of typing the name temps to see the data frame, we used a call the the head function instead. This will show me just the first six observations (by default) of the data frame, and is very handy to check to make sure a large data.frame really looks the way you think. (There's a function called tail that shows the last lines in an object as well.)

Suppose we want to concentrate on the maximum daily temperature (which we've called max in our data frame) among the days recorded. There are several ways we can refer to the columns of a data frame:

Probably the easiest way to refer to this column is to use a special notation that eliminates the need to put quotes around the variable names (unless they contain blanks or other special characters). Separate the data frame name from the variable name with a dollar sign ($):
```
> temps$max
 [1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7 
```
We can treat the data frame like it was a matrix. Since the maximum temperature is in the third column, we could say
```
> temps[,3]
 [1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7 
```

Since we named the columns of temps we can use a character subscript:

> temps[,"max"]
 [1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7

When you use a single subscript with a data frame, it refers to a data frame consisting of just that column. R also provides a special subscripting method (double brackets) to extract the actual data (in this case a vector) from the data frame:
```
> temps['max']
    max
1  59.5
2  55.7
3  57.3
4  71.5
5  69.8
6  68.8
7  67.5
8  66.0
9  66.1
10 61.7
> temps[['max']]
> temps[['max']]
[1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7
```
Notice that this second form is identical to temps$max. We could also use the equivalent numerical subscript (in this case 3) with single or double brackets.
If you want to work with a data frame without having to constantly retype the data frame's name, you can use the with function. Suppose we want to convert our minimum and maximum temperatures to centigrade, and then calculate the difference between them. Using with, we can write:
```
> with(temps,5/9*(max-32) - 5/9*(min-32))
 [1]  4.888889  1.611111  4.833333 10.277778 11.055556 11.611111  7.444444
 [8] 10.222222 12.500000  9.000000
```
which may be more convenient than typing out the data frame name repeatedly:
```
> 5/9*(temps$max-32) - 5/9*(temps$min-32)
 [1]  4.888889  1.611111  4.833333 10.277778 11.055556 11.611111  7.444444
 [8] 10.222222 12.500000  9.000000
```
Finally, if the goal is to a add one or more new columns to a data frame, you can combine a few operations into one using the transform function. The first argument to transform is the name of the data frame that will be used to construct the new columns. The remaining arguments to transform are name/value pairs describing the new columns. For example, suppose we wanted to create a new variable in the temps data frame called range, representing the difference between the min and max values for each day. We could use transform as follows:
```
> temps = transform(temps,range = max - min)
> head(temps)
    day  min  max range
  1   1 50.7 59.5   8.8
  2   2 52.8 55.7   2.9
  3   3 48.6 57.3   8.7
  4   4 53.0 71.5  18.5
  5   5 49.9 69.8  19.9
  6   6 47.9 68.8  20.9
```
As can be seen, transform returns a new data frame like the original one, but with one or more new columns added.

File translated from T_EX by T_TH, version 3.67.
On 30 Apr 2010, 16:31.